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Abstract 

Balancing computational efficiency with recognition accuracy is one of the major challenges in real-world 
video-based face recognition. A significant design decision for any such system is whether to process and 
use all possible faces detected over the video frames, or whether to select only a few 'best' faces. This paper 
presents a video face recognition system based on probabilistic Multi-Region Histograms to characterise 
performance trade-offs in: (i) selecting a subset of faces compared to using all faces, and (ii) combining 
information from all faces via clustering. Three face selection metrics are evaluated for choosing a subset: 
face detection confidence, random subset, and sequential selection. Experiments on the recently introduced 
MOBIO dataset indicate that the usage of all faces through clustering always outperformed selecting only a 
subset of faces. The experiments also show that the face selection metric based on face detection confidence 
generally provides better recognition performance than random or sequential sampling. Moreover, the 
optimal number of faces varies drastically across selection metric and subsets of MOBIO. Given the 
trade-offs between computational effort, recognition accuracy and robustness, it is recommended that face 
feature clustering would be most advantageous in batch processing (particularly for video-based watchlists) , 
whereas face selection methods should be limited to applications with significant computational restrictions. 
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1 Introduction 

While there has been a substantial amount of re- 
search in still image face recognition, there has 
been comparatively less on video face recognition. 
Video typically provides much more information 
for recognition compared to still images, includ- 
ing temporal and multiview information. However, 
one of the major challenges in video is to decide 
how to maximise the usage of available information 
while ensuring the system can still run in a scalable 
and timely manner. For example, despite typically 
having many frames of face information available 
from video, one of the design decisions for any face 
recognition system includes how many faces to use 
and, if not all, how to select them. There is po- 
tentially a trade-off between computational effort 
and recognition accuracy, which can be influenced 
by the number of faces used. 

Additionally, we are interested in addressing real- 
world video recognition problems where the envir- 
onment is uncontrolled and subjects may not be 
actively cooperating with the camera. Further- 
more, the quality of images can vary quite dra- 
matically. For instance, in surveillance contexts, 
CCTV video suffers from low quality, resolution 
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mismatches, varying pose and lighting from cam- 
era to camera, and also within the same camera 
depending on time of day (changes in lighting and 
shadows). Another example of an uncontrolled 
environment is handheld mobile video, which often 
suffers from quality issues such as lens smudging, 
blur, pose and lighting changes due to variation 
between scenes (indoor/outdoor). In addition to 
the above image variations, face detection and align- 
ment will also have great influence on the recogni- 
tion performance [1]. Many face recognition al- 
gorithms assume the faces are well aligned and 
normalised, which may not be the case, especially 
for low quality video. Thus to address these issues, 
not only does the face recognition system need to 
be scalable and efficient, but it also has to be robust 
to common issues that affect recognition accuracy. 

This paper describes a system for video-to-video 
face recognition which uses an adapted form of 
the probabilistic Multi-Region Histogram (MRH) 
method originally developed for still-to-still face 
recognition [2]. We have chosen to extend it to 
video-to-video recognition as it has shown robust- 
ness to alignment errors as well as variations in il- 
lumination, pose and image quality. Furthermore, 
MRH is relatively computationally efficient, mak- 
ing it suitable as a starting point for developing a 
scalable video-to-video recognition system. 



Within the video-based system, we characterise the 
impact on recognition accuracy when using various 
methods of face selection to choose only a subset of 
faces for recognition. We also contrast those meth- 
ods with the alternative of using information from 
all faces through feature clustering. We examine 
the trade-offs between computational effort and 
recognition performance present in face selection 
and clustering, and suggest situations where the 
two approaches might be best utilised. 

The paper proceeds as follows: Sections 2 and 3 
provide background on face selection and feature 
clustering in video; Section 4 describes our video- 
based face recognition framework; Section 5 dis- 
cusses the experiments on the MOBIO dataset to 
compare the different approaches in face selection, 
and contrasts that to the utilisation of all faces 
through feature clustering. Conclusions and direc- 
tions for future work are given in Section 6. 

2 Background: Face Selection 

While there has been several surveys on video- 
based face recognition [3, 4, 5, 6], face selection 
has not been reviewed as a component in existing 
face recognition systems until recently (2010) [6]. 

As larger video datasets are being made available 
including the Mobile Biometrics (MOBIO) 
dataset [7], which has made face selection a more 
prominent topic to investigate. MOBIO has 17,480 
videos and over 3 million frames — with such a 
large amount of information, balancing computa- 
tional efficiency with recognition performance be- 
comes very necessary. 

Shan [6] calls the approach of independently using 
all or a subset of face images with a still image- 
based recognition method the 'key-frame (or ex- 
emplar) based approach'. In most cases ad-hoc 
heuristics are used to select key-frames. A common 
way of selecting a subset of faces is through a met- 
ric based on face detection confidence after the face 
detection step [7]. Face confidence metrics can be 
based on located facial features (such as eyes and 
nose) within the face [8] , or face classification using 
pre-trained binary classifiers [9, 10]. The number 
of selected faces is typically chosen in a heuristic 
manner, such as the number of faces or faces above 
a certain threshold of confidence. 

There are typically two main reasons for not using 
all faces: the first is computational effort due to 
the size of the dataset, the second is that the mar- 
ginal gain in recognition accuracy decreases after 
a certain number of faces [7]. We will discuss the 
computational effort trade-offs in Section 5.3, and 
our experiments in Section 5.1 will analyse the 
second reason in more detail. 



3 Background: Face Clustering 

In the cases where computation time is less of a 
limitation, such as offline or batch processing, there 
is potential to utilise information from all faces in 
a video. However, using all faces for recognition 
must still be done in a tractable manner, either 
in the recognition step itself or in a pre-processing 
step. We propose to use facial feature clustering as 
such a pre-processing step. 

Historically, video face recognition methods origin- 
ate from still-image based techniques, which get 
applied over the multiple face frames by treating 
each as a still image [5] and modify the distance 
calculation to accommodate multiple identification 
hypotheses. These approaches are classified by 
Matta and Dugelay [3] as approaches that neglect 
temporal information. This class of video face re- 
cognition makes up the majority of the face recog- 
nition systems published [5]. They include video 
extensions of PCA, LDA, Active Appearance Mod- 
els and Elastic Graph Matching. The major draw- 
back to these approaches is that they may become 
computationally intractable to store and search for 
any significant amount of video. They also do not 
take advantage of the fact that sequential faces 
may be very similar and thus may be grouped to- 
gether to reduce redundancy. 

Temporal model and image-set matching 
approaches address this issue by modeling the dis- 
tribution of face images over time or by features [6] . 
These approaches tend to integrate the information 
expressed by all the face images into a single model. 

One such image-set matching solution is to cluster 
similar faces by feature similarity. Lee et al. [11] 
proposed to learn a low-dimensional manifold, which 
is approximated by piecewise linear subspaces. To 
construct the representation, exemplars are first 
sampled from videos by finding frames with the 
largest distance to each other corresponding to head 
pose changes in video, which are further clustered 
using K-means clustering. Each cluster models 
face appearance in nearby poses, represented by 
a linear subspace computed by PCA. Arandjelovic 
et al. [12] model the face appearance distribution as 
Gaussian Mixture Models (GMMs) on low- 
dimensional manifolds. In further work [13], they 
derived a local manifold illumination invariant, and 
formulated the face appearance distribution as a 
collection of Gaussian distributions corresponding 
to clusters obtained by fc-means. 

We propose a similar approach of clustering face 
features as a collection of Gaussians as a pre-cursor 
to face recognition for any system. This will be 
demonstrated using Multi-Region Histogram fea- 
tures rather than the previously used local mani- 
folds. 



4 Video-to-Video Matching 

A generic face recognition system has the compon- 
ents of face detection, feature extraction, and face 
matching [5]. The two system components being 
proposed and analysed in this paper, face selection 
and facial feature clustering, fall in between the 
face detection and recognition steps, as illustrated 
in Fig. 1. The face recognition system presented 
here uses OpenCV for face detection in conjunc- 
tion with a modified form of MRH [2] for feature 
extraction. Details of the system components are 
given below. 

4.1 Face Localisation 

For face localisation, OpenCV's Haar Feature-based 
Cascade Classifier [14] is used to detect and local- 
ise faces in each frame. Eyes are located within 
each face using a Haar-based classifier. If no eyes 
are found, their locations are approximated based 
on the size of the localised face. The faces are 
then resized and cropped such that the eyes are 
at predefined locations with a 32-pixel inter-eye 
distance. The final face is a closely cropped 'inner' 
faces of size 64x64 pixels (as later seen in Fig. 3), 
which attempts to exclude image areas susceptible 
to disguises, such as the hair and chin. 

4.2 Face Selection 

One approach for face selection is based on a metric 
of face detection confidence, which is the confid- 
ence of a face classifier that the region of interest 
is a face. The implementation of this metric varies. 
A generic method is to apply a post-processing step 
of a face or non-face binary classifier for all faces 
detected to obtain a confidence measure [9]. We 
compare a face confidence method in [7], which 
is based on where landmarks are detected within 
the face (such as eyes and nose), to more naive 
methods of random and sequential selection. 

Given any video Vi for person i, the number of 
faces extracted from the video is AT,. Faces lj from 
video Vi are sorted chronologically and indexed by 
j. We can then select m faces from Vi to form a 
face set S — {l qi , Z 92 , • • • , l Qm }. For random selec- 
tion, qk = rand(iVj),fc e [ljiri], where rand(iVj) 
generates a unique random number between 1 and 
Ni. For sequential selection, we select the first m 
faces from Vi, that is q\ = 1, q% = 2, • • • , q m = m. 
For confidence selection, each face lj is processed 
by the face detector to get the confidence of the 
detection Cj. The top m faces with the highest 
confidence are selected. 

4.3 Feature Extraction using MRH 

The MRH approach is motivated by the concept 
of 'visual words' (originally used in image categor- 



isation [15]) as well as the semi-loose spatial con- 
straints between face parts in 2D Hidden Markov 
Models [2]. It can briefly described as follows. 
A given face is divided into several fixed and ad- 
jacent regions (e.g. 3x3) that are further divided 
into small overlapping blocks (with a size of 8x8 
pixels). For region r a set of low-dimensional fea- 
ture vectors is obtained from the blocks in that 
region, F r = {f r ,i}fLi- Each block is normalised to 
have zero mean and unit variance, and descriptive 
features are extracted from each block via 2D DCT 
decomposition [16]. Each feature vector f rj j ob- 
tained from region r is then represented as a high- 
dimensional probabilistic histogram: 
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where the <?-th element in h rj4 is the posterior prob- 
ability of f r ,i according to the g-th component of a 
'visual dictionary' model, with an associated weight 
of w g . The dictionary is a Gaussian Mixture Model 
with 1024 components, built from low-dimensional 
2D DCT features extracted from training faces. 
The mean of each Gaussian in the dictionary can 
be thought of as a particular 'visual word'. Ro- 
bustness to face misalignment is achieved by rep- 
resenting each region as one average histogram: 

1 <sr^M 

h r ,avg = / t . 1 kr,» (2) 

For faces with a size of 64x64 pixels, there are 9 
regions arranged in a 3x3 layout. This results in 
an MRH signature composed of 9 histograms, with 
each histogram having 1024 components: 

MRH = [ /ll, a vg, ^-2, avg, ' • • ) ^9, avg ] (3) 

One MRH signature is used to represent each face 
in each frame. 

4.4 Feature Clustering 

We choose the widely known fc-means algorithm to 
group a set of faces into k clusters and represent 
each cluster by its centroid [17, 18]. We adapt it 
to dealing with videos and MRH face signatures 
by seeding the k clusters with faces spaced at reg- 
ular intervals within a video; the distance metric 
used during the clustering process is described in 
Eqn. (4). 

Once the fc-MRH clusters have been generated, the 
average MRH of each cluster's signatures is used 
as the representative signature. The special case 
of k = 1 is just an average MRH signature over all 
available faces. In the experiments we also apply 
clustering on faces from multiple videos belonging 
to the same person. 
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Figure 1: A recognition system with the proposed face selection and feature clustering steps highlighted. 



4.5 MRH Signature Comparison 

Two MRH signatures, X and Y, are compared 
using an Li-norm based distance measure: 

d m (X,Y) = \\X-Y\\ 1 (4) 

A decision on whether X and Y represent the same 
person (i.e., matched pair) or two different per- 
sons (i.e., mismatched pair) can be obtained by 
comparing d ra „(AT, Y) to a threshold. However, 
in order to provide further robustness to varying 
image conditions present in X and Y, a normalised 
distance can be obtained by adapting the cohort 
normalisation approach originally used in speech 
processing [2, 19]: 



2l? Si=l {dra.v(X, C,;) + d Tav (Y, Cj)} 

(5) 

Here, d is the i-th cohort face and M is the number 
of cohorts, with the cohort faces taken from the 
training set. 

For probes and galleries with multiple MRH signa- 
tures, each of the probe's K p MRH signatures are 
individually compared to each of the gallery's K g 
MRH signatures, resulting in K p x K g distances. 
The lower the distance, the more similar two sig- 
natures are, thus the minimum distance is taken 
as the final distance between a probe-gallery video 
pair. The minimum distance is then compared to 
a threshold to obtain the final match/mismatch 
decision. 

An appropriate threshold can be determined us- 
ing a labelled set by looking at the value which 
results in the minimum amount of false positives 
(matching probe and gallery identities with a dis- 
tance greater than the threshold) and false neg- 
atives (non-matching probe and gallery identities 
with a distance less than the threshold). This is 
also referred to as minimum error rate and used in 
the experiments. 

5 Experiments and Discussion 

In our experiments we used the large-scale 'Mobile 
Biometry' (MOBIO) dataset, which has been cre- 
ated as part of a European project focusing on bio- 
metric person recognition from portable devices [7] . 
The dataset is split into three distinct sets: one for 
training, one for development and one for testing. 
No persons are shared across any of the three sets. 



The protocol for enrolling and testing is the same 
for the the development set and the test set. There 
are five enrolment videos for each user and 75 test 
client (positive sample) videos for each user (15 
from each session). When producing impostor 
scores all the other clients are used, for instance 
if in total there were 50 clients then the other 49 
clients would perform an impostor attack. 

For the development set, there are 20 female and 
27 male users, which results in 30,000 probe to 
user comparisons for females and 54,675 for males. 
For the test set, there are 22 female and 39 male 
users, resulting in 36,300 comparisons for females 
and 114,075 for males. 

The MOBIO experiment protocol involves evaluat- 
ing a face recognition system on the development 
and test subsets for males and females. In this 
paper, we present the results of the four subsets 
with minimum error rate (MER), given by: 

MER = min - {FAR t + FRR t ) (6) 

where FAR t and FRR t are the false acceptance 
rate and false rejection rate obtained at threshold t. 
An equal weighting was chosen for FAR and FRR 
to remain application neutral. MER is a variant of 
the equal error rate (EER) [19], but is considered 
to be more reliable as it does not make any as- 
sumptions about the shape of the FAR and FRR 
curves. 

5.1 Face Selection 

In the face selection approach, a subset of faces 
are chosen for recognition based on a particular 
selection metric to characterise whether different 
metrics can improve recognition performance, and 
if so, by how much. 

Our first experiment compares the recognition ac- 
curacy across the following three selection meth- 
ods, on an increasing number of faces selected: 
(i) face detection confidence, (ii) random selection, 
and (iii) sequential selection. After the face are se- 
lected, the average MRH signature over all selected 
faces is used for recognition. The MER results are 
presented in column (a) of Fig. 2. The following 
three main observations can be made: 

1. Using multiple faces always performs better than 
using only one face, but using all faces does not 
guarantee the best performance. This implies 
that average MRH signatures are generally a 



good representation of the varied samples as 
the performance after averaging is always better 
than single face. 

2. The face selection method itself affects the re- 
cognition rate drastically. Random selection seems 
to provide slightly better performance in terms 
of minimal error rate for recognition when com- 
pared to sequential sampling of faces. The reason 
from an information point of view is that se- 
quential faces are very likely to have much less 
variation compared to faces sampled randomly 
throughout the video. Face confidence, the most 
computationally expensive one tested, typically 
gives better performance overall compared to 
the other two metrics. The reason might be 
due to better alignment (i.e., a more frontally 
aligned face) as the confidence is related to how 
well facial landmarks are located within the face. 

3. The optimal number of faces (the number which 
gives the lowest error) varies drastically across 
face selection methods as well as the MOBIO 
subsets. Typically, training data is used for set- 
ting parameters such as the number of faces to 
use, and is assumed to have similar characterist- 
ics as test data. We can see that even within the 
same dataset such as MOBIO, this assumption 
does not hold true. This highlights the fragil- 
ity of face selection - the application depends 
on heuristic methods (i.e., number of faces or 
threshold of confidence) which is very depend- 
ent on the data and method. As such, face 
selection is not likely to translate well across 
various datasets. 

5.2 Face Feature Clustering 

In the cases where computation time is less of a 
limitation, such as offline or batch processing, there 
is potential to utilise information from all faces in 
a video. We propose clustering of facial features 
as a pre-cursor to face recognition to make the 
face matching stage computationally tractable and 
more memory efficient (just storing and comparing 
the cluster centroids). In the MRH framework, 
clustering also takes advantage of the observation 
made in Section 5.1, where higher recognition ac- 
curacy was achieved by using multiple faces rather 
than a single face. This suggests average MRH 
signatures (such as a centroid of an MRH cluster) 
would provide better signatures for recognition. 

The experiments for fc-means clustering were done 
using single video (where faces from just one video 
of a person are clustered) and multiple videos (where 
faces from all videos of the same person are clustered). 
The recognition results for both cases using vary- 
ing k are presented in column (b) of Fig. 2. The 
following three main observations can be made: 



1. For every subset, the optimal k was greater than 
1. As an example, the female development sub- 
set seems to give the best results for clustering 
at k = 2. Fig. 3 shows a few images from a 
video in that subset for two clusters. As can 
be observed in Fig. 3, clustering yielded visu- 
ally discernable differences between the images. 
Cluster 2 has more closely cropped faces (with 
borders cutting off the edges of the face) , whereas 
Cluster 1 shows more of the chin, hair and a bit 
of background. This is reflective of face align- 
ment errors due to inaccuracies in eye localisa- 
tion in the first stage of the video-based recogni- 
tion system, and indicates that clustering may 
be a good way to minimise errors introduced in 
earlier stages of the system. 

2. The clustering of faces from multiple videos was 
nearly always better (in terms of finding the 
overall minimum error rate for a subset) than 
clustering of faces from a single video alone. 
In MOBIO, each client gallery consisted of five 
separate videos. Based on the clustering res- 
ults, these videos seem to be recorded in similar 
environments. Due to the similarity overlap, 
clustering by gallery likely resulted in better 
performance due to more samples in each cluster 
to provide more robust MRH signatures. 

3. The optimal k varied depending on the MOBIO 
subset. However, unlike face selection where 
heuristics are used to select the optimal number 
of faces, there is extensive literature on cluster- 
ing methods which find the 'natural clusters' to 
fit the data [17]. Thus clustering can be more 
robust in terms of maintaining optimal recog- 
nition accuracy for a video recognition system 
across many different datasets. 

5.3 Comparing Face Selection & Clustering 

While face selection and feature clustering are not 
mutually exclusive components in a video-based 
recognition system, both can separately contribute 
to reducing the computational effort to different 
degrees - face selection reduces the computational 
requirements for the subsequent steps of feature ex- 
traction and matching (distance calculation) , while 
feature clustering reduces the computation require- 
ments for the matching step only. 

The computation times for each step of the face 
recognition process is provided in Table 1. Feature 
extraction is the most computationally expensive 
part, taking 0.390 seconds per face. The time taken 
scales linearly with the number of faces selected 
or desired (N). The calculation of the normal- 
ised distance for a pair of MRH signatures takes 
approximately 0.002 seconds. However, if no clus- 
tering is performed and the distance is calculated 
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Figure 2: MER obtained on MOBIO using: [a] three frame selection methods, and [b] feature clustering across 
single and multiple videos. The four subsets of MOBIO were evaluated separately: male development (MD), 
female development (FD), male test (MT), female test (FT). Observations for frame selection: (i) multiple frames 
outperform one frame, (ii) confidence-based selection generally gives the lowest error, (iii) the number of frames 
for lowest error varies across subsets and selection metrics. Observations for feature clustering: (i) the optimal 
k is greater than 1, (ii) clustering of faces from multiple videos is generally better than clustering of faces for 
each video independently, (iii) the optimal k varies depending on subset and method. Comparing face selection 
to clustering (row- by-row) , clustering always gives the lowest overall error. 




Figure 3: Results from clustering of faces from a video in the female development set, using k = 2. 
Clustering yielded visually observable differences of two distinct face sizes resulting from inaccuracies in earlier 
eye localisation. 



naively on a pairwise basis per face per video, the 
distance time scales to N 2 and can quickly exceed 
the computation time for feature extraction. 

For the experimental system (Table 1), the number 
of faces (N) at which the naive matching exceeds 
the feature extraction time can be found by solving 
0.390 * N = 0.002 * N 2 , which is N = 195. On 
the otherhand, when clustering is used, the dis- 
tance scales to the number of clusters (K) squared, 
where K will always be less than or equal to N. 
This demonstrates how computationally inefficient 
it is to not use clustering or some other modeling 
method for reduction of signature sets. 

In terms of performance for face recognition ac- 
curacy, Table 2 shows that utilising information 
from all faces through clustering consistently shows 
better accuracy than using a subset of faces. 



As separately observed both in the face selection 
and feature clustering experiments, the optimal 
value (i.e., number of faces or clusters) varied de- 
pending on the dataset and method. It was noted 
that for face selection, since the thresholds arc chosen 
hcuristically, this approach is particularly fragile to 
variations between datasets, which would lead to 
suboptimal performance. For feature clustering, 
variations in the optimal k are less of an issue 
as there is extensive work in finding the 'natural 
clusters' to fit the data [17]. Thus clustering is 
a more robust and reliable means of consistently 
boosting face recognition accuracy. 

The above trade-offs between computation and ac- 
curacy are interesting to characterise as they aid 
in determining which approach is most suitable 
for particular applications. For example, face se- 



Face Selection 
Method 


Time (sec) 
per 200 faces 


Sequential/Temporal 
Random 
Confidence 
None (all faces) 


< 0.001 

< 0.001 
16.250 

< 0.001 



Number of 
Clusters (K) 


Time (sec) 
per 200 faces 


1 


0.003 


2 


0.027 


4 


0.037 


6 


0.060 


8 


0.125 



Table 1: Time taken on an Intel Xeon CPU @ 3.0 GHz system, using Linux 2.6.24 and GCC 4.2.4. Face selection 
time scales linearly to the number of original input faces. For the N selected faces, clustering time scales linearly 
to NK where K is number of clusters (maximum number of iterations is limited to 20). 





Best Face Selection 


Best Feature Clustering 




Method 


Faces 


MER 


Method 


k 


MER 


Male dev subset 


any 


all 


22.41 


multiple 


4 


21.12 


Female dev subset 


face conf. 


4 


18.65 


multiple 


2 


18.56 


Male test subset 


random 


16 


18.88 


single 


8 


18.30 


Female test subset 


face conf. 


16 


15.25 


multiple 


4 


14.36 



Table 2: The lowest Minimum Error Rate (MER %) for face selection and feature clustering on MOBIO. 
Under face selection methods, 'any' refers to any of the three methods tested, 'face conf.' refers to face detection 
confidence. Under feature clustering methods, 'single' refers to clustering faces within each video, 'multiple' refers 
to clustering of faces from multiple videos within a person's gallery. The lowest MER for each MOBIO subset is 
always obtained through clustering. 



lection is more suitable for systems which have 
real-time requirements (such as live video monit- 
oring) or limited computation restrictions (such as 
mobile phones). In contrast, feature clustering is 
more suitable for batch or offline processing such 
as forensic applications and pre-processed galleries 
for watchlists. 

6 Conclusions and Future Work 

In this paper, we examined two approaches of im- 
proving the performance of a video-based face re- 
cognition system — face selection and face feature 
clustering. Three methods of face selection were 
investigated: face detection confidence, random se- 
lection and sequential selection. 

In comparing the three selection methods, it was 
found that: (i) using multiple faces is always better 
than using a single face alone, (ii) the face de- 
tection confidence metric typically provides better 
results when using a subset of faces, and (iii) the 
optimal number of faces to use varies drastically 
across selection methods and datasets (subsets of 
MOBIO). 

For feature clustering, we used a fc-means approach 
and found that the optimal k varied across data- 
sets, and that more faces provided better cluster 
representations for recognition (i.e., clustering faces 
from multiple videos together is better than clus- 
tering from a single video). 

When compared to face selection, the lowest error 
rates were always obtained through clustering at 



the expense of higher computational effort. With 
face selection, the computation of a selection met- 
ric is typically low. However, its major drawback is 
that the selection of the number of faces is done in 
a heuristic manner, and as such highly dependent 
on both dataset and face selection metric. The 
parameters of face selection are not likely to trans- 
late well across datasets, thus potentially giving 
sub-optimal results. 

With face feature clustering, the optimal number 
of clusters may vary across datasets, however there 
are many principled methods of adaptive clustering 
to find the optimal number of clusters [17]. As 
such, the clustering approach is more robust and 
transferable across datasets. However, its main 
drawback is the computational effort required for 
face detection in all frames and the subsequent 
feature extraction. 

Based on the above trade-offs, our experiments 
suggest that designers of video-based recognition 
systems should use facial feature clustering if they 
are able to process videos in a batch fashion (off- 
line), as clustering can robustly maximise recog- 
nition accuracy. This would also be applicable 
to galleries of online systems as the galleries are 
typically processed in batch. In contrast, if the 
application has real-time requirements such as live 
video monitoring in surveillance, the selection of 
faces using a good face confidence metric may make 
the most sense. 

Though we investigated face selection and face fea- 
ture clustering individually, it is still worth ex- 



ploring the combination of these two techniques, 
such as clustering on selected faces. In addition, 
it would also be worthwhile to further examine 
face feature clustering on other face recognition 
techniques. 
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